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Abstract. One initial goal for the DRMF is to seed our digital com¬ 
pendium with fundamental orthogonal polynomial formulae. We had 
used the data from the NIST Digital Library of Mathematical Func¬ 
tions (DLMF) as initial seed for our DRMF project. The DLMF input 
DdHX source already contains some semantic information encoded us¬ 
ing a highly customized set of semantic RTjjjX macros. Those macros 
could be converted to content MathML using DTRxml. During that 
conversion the semantics were translated to an implicit DLMF content 
dictionary. This year, we have developed a semantic enrichment process 
whose goal is to infer semantic information from generic DTpjX sources. 

The generated context-free semantic information is used to build DRMF 
formula home pages for each individual formula. We demonstrate this 
process using selected chapters from the book “Hypergeometric Orthog¬ 
onal Polynomials and their q- Analogues” (2010) by Koekoek, Lesky and 
Swarttouw (KLS) as well as an actively maintained addendum to this 
book by Koornwinder (KLSadd). The generic input KLS and KLSadd DTgjX 
sources describe the printed representation of the formulae, but does not 
contain explicit semantic information. See http://drmf.wmflabs.org. 

1 Introduction 

Formula home pages are the principal conceptual objects for the DRMF project. 
These should contain the full context-free semantic information concerning indi¬ 
vidual orthogonal polynomial and special function (OPSF) formulae. The DRMF 
is designed for a mathematically literate audience and should (1) facilitate in¬ 
teraction among a community of mathematicians and scientists interested in 
compendia formulae data for orthogonal polynomials and special functions; (2) 
be expandable, allowing the input of new formulae from the literature; (3) rep¬ 
resent the context-free full semantic information concerning individual formulae; 
(4) have a user friendly, consistent, and hyperlinkable viewpoint and authoring 
perspective; (5) contain easily searchable mathematics; and (6) take advantage 

* The final publication is available at http://link.springer.com. 
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Fig. 1. Data flow of seeding projects. For most of the input RTjgX source distributions, 
DLMF and DRMF macros are not incorporated. For the DLMF Dlj^X source, the 
DLMF macros are already incorporated. 


of modern MathML tools for easy-to-read, scalably rendered content driven 
mathematics. In this paper we will discuss the DRMF seeding projects whose 
goal is to import data, for example, from traditional print media (cf. Figure I). 

We are investigating various sources for seed material in the DRMF [3]. We 
have been given permission to use a variety of input resources to generate our 
online compendium of mathematical formulae. The current sources that we are 
incorporating into the DRMF are given as follows: (1) NIST Digital Library of 
Mathematical Functions (DLMF 1 ) [1, 6]; (2) Chapters 1, 9, and 14 (a total of 228 
pages with about 1800 formulae) from the Springer-Verlag book “Hypergeomet¬ 
ric Orthogonal Polynomials and their q-Analogues” (2010) by Koekoek, Lesky 
and Swarttouw (KLS) [7]; (3) Tom Koornwinder’s Additions to the formula lists 
in “Hypergeometric orthogonal polynomials and their q-Analogues” by Koekoek, 
Lesky and Swarttouw (KLSadd) [10]; (4) Wolfram Computational Knowledge of 
Continued Fractions Project (eCF); and the Bateman Manuscript Project (BMP) 

[ ] (see Table 1). Note that the DLMF, KLS, KLSadd, and eCF datasets are 

currently being processed within our pipeline. For the BMP dataset, we have fur¬ 
nished high-quality print scans to Alan Sexton and are currently waiting on the 
math OCR generated DTj^X output for this dataset which is currently being 
generated. In this paper we focus on DRMF seeding of generic DTj^X sources, 
namely those which do not contain explicit semantic information. 

2 Seeding with Generic I^TeX Sources 

DRMF seeding projects collect and stream OPSF mathematical formulae into 
formula pages. Formula pages are classified into those which list formulae in a 
broad category, and the individual formula home pages for each formula. Gener¬ 
ated formula home pages are required to contain bibliographic information and 
usually contain a list of symbols, substitutions and constraints required by the 
formulae, proofs and formula names if available, as well as related notes. Every 
semantic formula entity (e.g., function, polynomial, sequence, operator, constant 
or set) has a unique name and a link to its definition or description. 

We use the typewriter font in this document to refer to our seeding datasets. 
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Table 1. Overview of the first three stages of the DRMF project. Note that the numbers 
which are given are rough estimates. 



Stage 1 

Stage 2 

Stage 3 

Started in 

2013 

2014 

2015 

Dataset 

DLMF, 

semantic DTjijX 

KLS, 

plain DTfjjX 

eCF: Mathematica 
BMP: book images 

Semantic 

ENRICHMENT 

identify constraints, 
substitutions, 
notes, names, 
proofs, ... 

add 

new 

semantic 

macros 

image recognition 
macro suggestion 

Technologies 

manual review, 
rule-based 
approaches 

improved rules 

natural language 
processing and 
machine learning 
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5000 
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CONTRIBUTION 

gold standard 
for constraint 
and proof 
detection 

gold standard 
for 

macro 

replacement 

evaluation 

metrics 


For HTjipC sources which are extracted from the DLMF project, the semantic 
macros are already incorporated [11]. However, for generic sources such as the 
KLS dataset, the semantic macros need to be inserted in replacement for the 
HTgX source which represents that mathematical object. 

Here we give representative examples for the trigonometric sine function, 
gamma function, Jacobi polynomial and little g-Laguerre/Wall polynomials, 
which are rendered respectively as sin z, r(z), and p n (x;a\q). These 

functions and orthogonal polynomials have HTj^X presentations given respec¬ 
tively by \sin z, \Gamma(z) , P_n~{ (\alpha, \beta) } (x) , and p_n(x; a I q) . The 
semantic representations for these functions and orthogonal polynomials are 
given respectively by \sin@@{z}, \EulerGamma@{z}, \Jacobi{\alpha}{\beta} 
{n}@{x}, \littleqLaguerre{n}@{x}{a}{q}. The arguments before the @ or @@ 
symbols are parameters and the arguments after the @ or @@ symbol are in the 
domain of the functions and orthogonal polynomials. The different between the 
@ or @@ symbols indicates a specified difference in presentation, such as the 
inclusion of the parentheses or not in our trigonometric sine example. For the 
little g-Laguerre polynomials, one has three arguments within parentheses. These 
three arguments are separated by a semi-colon and a vertical bar. Our macro 











replacement algorithm indentures these polynomials, and then extracts the in¬ 
formation about what the contents of each argument is. Furthermore there are 
many ways in DTj^X to represent open and close parenthese, our algorithm iden¬ 
tifies these. Also, since the vertical bar in DTj^X can be represented by ‘ |’ or 
‘\mid’, we search for both of these patterns. Our algorithm, for instance, also 
searches for and removes all DTgX white-space characters such as those given 
by \, \! or \hspace{}. There are many other details about making our search 
and replace work, which we will not mention here. 

3 KLS Seeding Project 

In this section we describe how we augment the input KLS Dl]gX source in order 
to generate formula pages (see Figure 1). We are developing software processes 
input DTj^X source to generate output DTj^X source with semantic mathematical 
macros incorporated. The semantic DTj^X macros that we are using (664 total 
with 147 currently being used for the DRMF project) are being developed by 
NIST for use in the DLMF and DRMF projects. Whenever possible, we use the 
standardized definitions from the NIST Digital Library of Mathematical Func¬ 
tions [6]. If the definitions are not available on the DLMF website, then we link to 
definition pages in the DRMF with included symbols lists. One main goal of this 
seeding project is to incorporate mathematical semantic information directly into 
the DTjrpC source. The advantage of incorporating this information directly into 
the DTj^X source is that mathematicians are capable of editing DTgX whereas 
human editing of MathML is not feasible. This enriched information can be 
further modified by mathematicians using their regular working environment. 

For the 3 chapters of the KLS dataset plus the KLSadd dataset, a total number 
of 89 semantic macros were replaced a total of 3308 times. That’s an average 
of 1.84 macros replaced per formula. Note that the KLSadd dataset is actively 
being maintained, and when a new version of it is published, in an automated 
fashion, incorporate this new information into the DRMF. This fraction will in¬ 
crease when more algebraic substitution formulae are included as formula meta¬ 
data. The most common macro replacements are given as follows. The macro for 
the cosine function, Racah polynomial, Pochhammer symbol, g-hypergeometric 
function, Euler gamma function, and g-Pochhammer symbol were converted a 
total number of times equal to 117, 205, 237, 266, and 659. Our current con¬ 
versions, which use a rule based approach, can be quite complicated due to the 
nature of the variety of combinations of DTj^X input for various OPSF objects. 
In DTj^X there are many ways of representing parentheses which are usually used 
for function arguments. Also, there are many ways to represent spacing delim¬ 
iters which can mostly be ignored as far as representing the common semantic 
information for a mathematical function call. Our software canonicalizes these 
additional meaningless degrees of freedom and generates easy-to-read semantic 
DTj^X source and improves the rendering. Developing automatic software which 
performs macro replacements for OPSF functions in DTj^X is a challenging task. 
The current status of our rule-based approach is highly tailored to our specific 
KLS and KLSadd input DTjrpi source. 


Historically, the desired need for formal consistency has driven mathemati¬ 
cians to adopt consistent and unique notations [2], This is extremely beneficial 
in the long run. We have interacted on a regular basis with the authors of the 
KLS and KLSadd datasets. They agree that our assumptions about consistent 
notations are correct and they consider using our semantic DTj^X macros in fu¬ 
ture volumes. Certainly the benefit of using these macros in communicating with 
different computer systems is clear. 

Once semantic macros are incorporated, the next task is to identify formula 
metadata. Formula metadata can be identified within and must be associated 
with formulae. One must then identify semantic information for the formula 
within the surrounding text to produce formula annotations which describe this 
semantic information. There are annotations which can be summarized as con¬ 
straints, substitutions, proofs and formula names if available, as well as related 
notes. The automated extraction of formula metadata is a challenging aspect 
of the seeding project and future computer implementations might use machine 
learning methods to achieve this goal. However, we have built automated algo¬ 
rithms to extract formula metadata. We have for instance identified substitutions 
by associating definitions for algebraic or OPSF functions which are utilized 
in surrounding formulae. The automation process continues by merging these 
substitution formulae as annotations in the original formulae which use them. 
Another extraction algorithm we have developed is the identification of related 
variables, understanding their dependencies and merging corresponding annota¬ 
tions with the pre-existing formula metadata. We have manually reviewed the 
printed mathematics to identify formula metadata. After we have exhausted our 
current rule-based approach for extracting the formula annotations, we will per¬ 
form the manual insertion of the missing identified annotations into the DTj^X 
source. This will then be followed by careful checking and expert editorial re¬ 
view. This also evaluates the quality of our rule-based approach and creates a 
gold standard for future programs. 

Once the formula metadata has been completely extracted from the text, then 
the remainder of the text should be removed and one is left with a list of DTJ^X 
formulae with associated metadata. From this list (at the current stage of our 
project), we use this semantic source to generate Wikitext. One of the fea¬ 

tures of the generated Wikitext is that we use a glossary that we have developed 
of our DLMF and DRMF macros to identify semantic macros within a formula 
and its associated metadata. Presentation and meaningful content MathML is 
generated from the DLMF and DRMF macros using a customized DTJjxml server 
(http://gwl25.iu.xsede.org) hosted by the XSEDE project that includes all 
generated semantic macros. From this glossary, we generate symbols lists for each 
formula which uses recognized symbols. The generated Wikitext is converted to 
the MediaWiki XML-Dump format, which is then bulk imported to our wiki in¬ 
stance. Our DRMF Wiki has been optimized for MATHML-output. Because 
we are using Mathoid to render mathematical expressions [14], browsers with¬ 
out MATHML-support can display DRMF formulae within MediaWiki. However, 


some MATHML-related features (such as copying parts of the MathML output) 
are not available on these browsers. 

At the moment, There are 1282 KLS and KLSadd wikitext pages. The current 
number of KLS and KLSadd formula home pages is 1219 and the percentage of 
non-empty symbols lists in formula home pages is given by 98.6 percent. This 
number will increase as we continue to merge substitution formulae into associ¬ 
ated metadata and as we continue to expand our macro replacement effort. We 
have detected 208 substitutions which originally appeared as formulae. We in¬ 
serted these in an automated fashion into 515 formulae. The goal of our learning 
is to obtain a mostly unambiguous content representation of the mathematical 
OPSF formulae which we use. 

4 Future outlook 

The next seeding projects which we will focus on are those which correspond to 
image and Mathematica inputs (see Table 1). We have been given permission 
from Caltech to use the BMP dataset within the DRMF. In the BMP dataset, the 
original source for data are printed pages of books. We are currently collabo¬ 
rating on the development of mathematical optical character recognition (OCR) 
software [15] for use in this project. We plan to utilize this math OCR software 
to generate DTj^X output which will be incorporated with the DLMF and DRMF 
semantic macros using our developed macro replacement software. 

We are already developing for our next source, namely the incorporation of 
the Wolfram eCF dataset into the DRMF. We have been furnished the Mathemat¬ 
ica source (also known as Wolfram language) for this dataset and we are currently 
developing software which translates in both directions from the Wolfram lan¬ 
guage to our semantic DTj^X source with DRMF and DLMF macros incorporated 
(cf. Table 1). 

For the DLMF source, due to the hard efforts of the DLMF team for more 
than the past ten years, we already have semantic macros implemented, and all 
that remains is to extract the metadata from the text associated with formulae, 
removing the text after the content has been transferred, converting formulae 
information in tables to lists of distinct formulae, and generating formula home 
pages. We already have mostly achieved this for DLMF Chapter 25 on the Rie- 
rnann Zeta function and are currently at work on Chapters 5 (gamma function), 
15 (hypergeometric function), 16 (generalized hypergeometric functions), 17 (g- 
hypergeometric and related functions) and 18 (orthogonal polynomials) which 
will ultimately be merged with the KLS and KLSadd datasets. Then we will con¬ 
tinue to the remainder of the DLMF chapters. 

Once semantic information has been inserted into the DTj^X source, there is 
a huge number of possibilities on how this information can be used. Given that 
our datasets are collections of OPSF formulae, we plan on taking advantage 
of the incorporated semantic information as an exploratory tool for symbolic 
and numerical experiments. For instance, one may use this semantic content to 
translate to computer algebra system (CAS) computer languages such as those 
used by Mathematica, Maple or Sage. One could then use the translated formulae 


while taking advantage of any of the features available in those software packages. 
We should also mention that the DRMF seeding projects generate real content 
MathML. This has been a huge problem for Mathematics Information Retrieval 
research for many years [9, 12]. One major contribution of the DRMF seeding 
projects is that they offer quite reasonable content MathML. 

From a methodological point of view, we are going to develop evaluation 
metrics that measure the degree of semantic formula enrichment. These should 
be able to evaluate new approaches such as mathematical language processing 
[13] and/or machine learning approaches based on the created gold standard. 
Additionally, we are considering the use of sTeX [8], in order to simplify the defi¬ 
nition of new macros. Eventually, we can also develop a heuristic which suggests 
new semantic macros based on statistical analysis. 
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